feat: benchmark-agnostic evaluation protocol (BenchmarkProtocol + registry) (#107)#129
Merged
cagataycali merged 1 commit intoMay 13, 2026
Conversation
Introduce BenchmarkProtocol ABC + string-keyed registry so every standard benchmark (LIBERO, Meta-World, RoboSuite, ManiSkill, user-authored tasks) can plug into the SimEngine eval loop without committing the core to any single benchmark's conventions. Adapters stay thin and ship in follow-up extras (strands-labs#108 Meta-World, strands-labs#109 RoboSuite, strands-labs#110 LIBERO). What lands here - strands_robots/simulation/benchmark.py: BenchmarkProtocol ABC + StepInfo dataclass + thread-safe string-keyed registry. Robot compatibility is first-class metadata; default on_episode_start auto-loads default_robot and validates loaded robots against supported_robots. - strands_robots/simulation/predicates.py: named-predicate library (body_above_z, joint_above, distance_less_than, inside_region, contact_between, contact_any, distance_neg, joint_progress, constant, ...). Closed registry, no eval() - safe for untrusted / LLM-authored specs. - strands_robots/simulation/benchmark_spec.py: DeclarativeBenchmark + register_benchmark_from_file loading YAML/JSON specs (JSON via stdlib, YAML gated behind require_optional('pyyaml')). - PolicyRunner.evaluate now accepts spec=BenchmarkProtocol and seed=int alongside the legacy success_fn path. Spec path adds cumulative reward, per-episode seeded RNG, is_failure early termination, and structured compatibility errors. Legacy path unchanged for backcompat. - SimEngine base adds evaluate_benchmark / list_benchmarks / register_benchmark_from_file facades, auto-dispatched via the existing _dispatch_action path. MuJoCo tool_spec.json gains three action enum entries plus benchmark_name / spec_path properties. Test coverage (110 new tests, all passing) - Protocol contract, registry ops, thread-safety, compatibility errors - Every predicate against lightweight fake sims + the reward math - DSL validation (good/bad specs), file loading (JSON/YAML), sandboxing - PolicyRunner.evaluate(spec=...) for cumulative reward, seed reproducibility, is_success/is_failure/done terminations, legacy backcompat, evaluate_benchmark / list_benchmarks / register_... facades - MuJoCo dispatch path for all three new actions Out of scope (tracked as follow-ups) - Meta-World / RoboSuite / LIBERO adapters (strands-labs#108, strands-labs#109, strands-labs#110) - BDDL parser, dense-reward curriculum tooling, RL training harness - Replacing the existing success_fn path (kept working) Refs strands-labs#107.
Contributor
Author
|
Requesting review from @sundargthb, @awsarron, @cagataycali 🙏 |
20 tasks
cagataycali
approved these changes
May 12, 2026
Member
cagataycali
left a comment
There was a problem hiding this comment.
LGTM. Reviewed the architecture, ABC contract, registry, predicates library, DSL loader, and test coverage.
What I like:
- Clean separation:
BenchmarkProtocolABC +StepInfodataclass keep the eval loop protocol-agnostic - Closed predicate registry (no
eval/exec) makes YAML specs safe for LLM-authored input - Thread-safe registry mirroring
register_urdfpattern — consistent with existing architecture on_episode_startdefault validates robot compatibility + auto-loadsdefault_robot— sensible convention- 110 tests covering contract, thread safety, DSL compilation, seed reproducibility, and MuJoCo dispatch
- Backcompat preserved: existing
success_fnpath unchanged
Architecture notes:
- The
PolicyRunner.evaluate(spec=...)widening keeps the existing surface area intact while enabling the new benchmark-driven path — nice dual-mode approach - Predicate factories returning
(SimEngine) -> bool|floatis the right abstraction level for backend-agnostic evaluation - The per-episode
rngthreading throughon_episode_start(sim, rng)ensures reproducibility without mutable state on benchmark instances
This is a solid foundation for #110 (LIBERO), #108 (Meta-World), and #109 (RoboSuite) to build on.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements #107: a benchmark-agnostic evaluation protocol so LIBERO, Meta-World,
RoboSuite, ManiSkill, and user-authored tasks can plug into the
SimEngineevalloop without committing the core to any single benchmark's conventions.
Adapters stay thin and ship in follow-up extras (#108 Meta-World, #109 RoboSuite,
#110 LIBERO). The existing
success_fnpath is kept working for backcompat.What's in scope (from #107)
BenchmarkProtocolABC +StepInfodataclass instrands_robots/simulation/benchmark.pyPolicyRunner.evaluatewidened to acceptspec: BenchmarkProtocol+seed, alongside the existingsuccess_fnpathtool_spec.jsonentries:list_benchmarks,register_benchmark_from_file,evaluate_benchmarkstrands_robots/simulation/predicates.py(11 built-ins:body_above_z,joint_above,distance_less_than,inside_region,contact_between,contact_any,distance_neg,joint_progress,constant, …)register_benchmark_from_file) restricted to the named-predicate DSL - noeval, noexec, safe for LLM-authored specsDeclarativeBenchmark(the DSL-driven adapter) serves as the end-to-end reference. A fullMetaWorldAdapteris tracked as a separate PR (Benchmark adapter: Meta-World (MetaWorldAdapter) #108) since it needs themetaworldenv + real task validation.Architecture notes
on_episode_startvalidates loaded robotdata_configagainstsupported_robotsand auto-loadsdefault_robotwhen the sim is empty. Mismatches surface as structured error dicts.register_urdf. Module-leveldict[str, BenchmarkProtocol]guarded by anRLock. Re-registration is idempotent-overwrite with a warning.PREDICATE_REGISTRY. Noeval. Sandboxed by construction.evaluate(spec=…, seed=42)seeds a master RNG, derives a per-episode child RNG, and threads it throughspec.on_episode_start(sim, rng).benchmark_name,spec_path).Verification
```bash
hatch run lint # ruff + mypy, all clean
hatch run pytest tests/simulation/test_benchmark.py \
tests/simulation/test_benchmark_predicates.py \
tests/simulation/test_benchmark_dsl.py \
tests/simulation/test_policy_runner_benchmark.py \
tests/simulation/mujoco/test_benchmark_dispatch.py
→ 110 passed
```
Full suite: 1347 passed, 31 skipped. 3 pre-existing failures all require OpenGL/OSMesa (confirmed failing on `main` too).
Example
Python:
```python
sim.register_benchmark_from_file(benchmark_name='drawer-open', spec_path='specs/drawer.yaml')
sim.evaluate_benchmark(benchmark_name='drawer-open', robot_name='arm', policy_provider='mock', n_episodes=10, seed=42)
```
Spec (`specs/drawer.yaml`):
```yaml
name: drawer-open
max_steps: 300
supported_robots: [panda]
default_robot: panda
success:
all:
- {predicate: joint_above, joint: drawer_slide, value: 0.15}
failure:
any:
- {predicate: body_below_z, body: gripper, z: -0.1}
dense_reward:
```
Out of scope (tracked as follow-ups)
Closes #107.